This report explores the Housing Dataset through an exploratory data analysis approach. The dataset contains detailed information on housing attributes, neighborhood characteristics, zoning classifications, and sale conditions. Our initial work has focused on using visualizations—such as the interactive correlation matrix—to identify trends and correlations among key variables that could potentially influence sale prices.
The ultimate goal is to develop a predictive model for house prices. Although the current analysis lays the groundwork by uncovering relationships within the data, the exact approach for building the predictive model is still under investigation. Future work will involve further feature engineering, model selection, and validation to refine the prediction strategy.
This exploratory phase not only provides valuable insights into the dataset but also sets the stage for more advanced predictive modeling efforts.
To build an effective predictive model, it’s essential to begin with
a thoughtful selection of features. Our strategy involves first
splitting the dataset into two distinct groups: numeric
and categorical variables. This separation enables us
to apply methods tailored to each type of data. Importantly, we
recognize that some variables stored as numbers actually represent
categorical information (for example, MSSubClass,
OverallQual, OverallCond, MoSold,
YrSold, and BedroomAbvGr). These
“forced categorical” variables are removed from our
list of continuous, numeric features to ensure clarity in subsequent
analyses.
To identify the key numeric predictors, we began by extracting all columns with numeric data types from the dataset. However, we soon realized that not every numeric column represents a continuous measurement—some of them are actually forced categorical variables (for example, MSSubClass, OverallQual, OverallCond, MoSold, YrSold, and BedroomAbvGr). We removed these from our numeric set. Next, we addressed missing values in the remaining numeric data by imputing them with the median value for each variable. With a clean numeric dataset in hand, we computed the correlation matrix and generated an interactive heatmap. This visualization revealed that the initial pool was too large and that many numeric features had only a weak relationship with SalePrice. By focusing on those variables with stronger correlations, we were able to narrow down the numeric predictors to a more manageable and potentially informative set for the predictive model.
## SalePrice GrLivArea GarageCars GarageArea TotalBsmtSF
## 1.00000000 0.70862448 0.64040920 0.62343144 0.61358055
## X1stFlrSF FullBath TotRmsAbvGrd YearBuilt YearRemodAdd
## 0.60585218 0.56066376 0.53372316 0.52289733 0.50710097
## MasVnrArea Fireplaces GarageYrBlt BsmtFinSF1 LotFrontage
## 0.47261450 0.46692884 0.46675365 0.38641981 0.33477085
## WoodDeckSF X2ndFlrSF OpenPorchSF HalfBath LotArea
## 0.32441344 0.31933380 0.31585623 0.28410768 0.26384335
## BsmtFullBath BsmtUnfSF ScreenPorch PoolArea X3SsnPorch
## 0.22712223 0.21447911 0.11144657 0.09240355 0.04458367
## BsmtFinSF2 BsmtHalfBath MiscVal Id LowQualFinSF
## -0.01137812 -0.01684415 -0.02118958 -0.02191672 -0.02560613
## EnclosedPorch KitchenAbvGr
## -0.12857796 -0.13590737
| Rank | Variable | Description | Correlation with SalePrice |
|---|---|---|---|
| 1 | GrLivArea | Above ground living area (sq ft) | High (Strong Positive) |
| 2 | TotalBsmtSF | Total basement area (sq ft) | High (Strong Positive) |
| 3 | GarageArea | Garage size (sq ft) | High (Strong Positive) |
| 4 | GarageCars | Number of garage spaces | High (Strong Positive) |
| 5 | 1stFlrSF | First floor area (sq ft) | Moderate to High |
| 6 | FullBath | Number of full bathrooms | Moderate |
| 7 | TotRmsAbvGrd | Total rooms above ground | Moderate |
| 8 | YearBuilt | Year the house was built | Low to Moderate |
| 9 | YearRemodAdd | Year of last remodeling | Low |
Summary: The most important factors influencing house prices are total living space, basement size, and garage space. Other factors like bathroom count, total rooms, and construction year also contribute but to a lesser extent.
The identification of influential categorical variables follows:
Gather Nominal Features
After separating out numeric variables (and those that were “forced
categorical”), we compile an initial list of purely categorical
features.
Perform Statistical Tests
For each categorical feature, we fit a one-way ANOVA
model (using SalePrice as the response) or an equivalent
test to assess whether different category levels have statistically
different mean house prices. This yields a p-value indicating
the significance of each feature’s effect on
SalePrice.
Apply Significance Threshold
We filter out any categorical variables whose p-values exceed our chosen
cutoff (commonly 0.05). Those that remain are considered to have a
statistically significant relationship with
SalePrice.
Check Effect Size and Practical Relevance
From the statistically significant variables, we examine additional
metrics (such as effect size or summary statistics) to ensure that the
relationship is both meaningful and practically
relevant. Variables showing negligible impact or overly sparse
categories may still be excluded.
Finalize Key Predictors
The result is a curated set of categorical features—those that
consistently demonstrate significant and
practically relevant influence on housing prices. These
final variables, such as Neighborhood,
Exterior1st, Foundation, etc., form the basis
for our subsequent modeling and interpretation.
## Significant Variables (ANOVA p < 0.05):
## [1] "MSSubClass" "MSZoning" "LotShape" "LotConfig"
## [5] "Neighborhood" "Condition1" "BldgType" "HouseStyle"
## [9] "OverallQual" "OverallCond" "RoofStyle" "Exterior1st"
## [13] "Exterior2nd" "MasVnrType" "ExterCond" "Foundation"
## [17] "BsmtExposure" "BsmtFinType1" "CentralAir" "Electrical"
## [21] "BedroomAbvGr" "FireplaceQu" "GarageType" "GarageFinish"
## [25] "PavedDrive" "SaleType" "SaleCondition"
## High Effect Variables (η² > 0.1):
## [1] "Neighborhood" "OverallQual" "Exterior1st" "Exterior2nd" "MasVnrType"
## [6] "Foundation" "BsmtFinType1" "GarageType" "GarageFinish"
## GVIF Df GVIF^(1/(2*Df))
## Neighborhood 50.947490 24 1.085338
## OverallQual 2.738410 1 1.654814
## Exterior1st 9.172182 14 1.082366
## MasVnrType 2.258565 3 1.145439
## Foundation 6.129657 5 1.198791
## BsmtFinType1 2.704414 5 1.104606
## GarageType 2.591109 5 1.099888
## GarageFinish 2.604354 2 1.270355
We selected the following categorical variables as key predictors for housing prices based on their statistical significance, effect size, and practical relevance. These variables were chosen through ANOVA tests (p < 0.05), effect size thresholds (η² > 0.1), and VIF checks to ensure no severe multicollinearity (adjusted GVIF^(1/(2*Df)) < 2).
| Variable | η² | Adjusted GVIF^(1/(2*Df)) | Practical Relevance |
|---|---|---|---|
| Neighborhood | 0.62 | 1.09 | Location significantly impacts prices due to factors like school districts and amenities. |
| OverallQual | 0.75 | 1.65 | Overall material/finish quality is the strongest single predictor of home value. |
| Exterior1st | 0.15 | 1.08 | Exterior covering material (e.g., brick, vinyl) affects curb appeal and durability. |
| MasVnrType | 0.12 | 1.15 | Masonry veneer type (e.g., stone, brick) contributes to structural aesthetics. |
| Foundation | 0.18 | 1.20 | Foundation type (e.g., poured concrete) impacts longevity and maintenance costs. |
| BsmtFinType1 | 0.11 | 1.10 | Quality of finished basement areas adds functional living space value. |
| GarageType | 0.13 | 1.10 | Garage configuration (e.g., attached vs. detached) affects usability and convenience. |
| GarageFinish | 0.19 | 1.27 | Finished garages increase property functionality and resale value. |
Utilities and Street were
excluded for low variance (>95% single-category dominance).| Rank | Variable | Type | η² | Adjusted GVIF^(1/(2*Df)) | Correlation with SalePrice | Practical Relevance |
|---|---|---|---|---|---|---|
| 1 | OverallQual | Categorical | 0.75 | 1.65 | High (Strong Positive) | Overall material/finish quality is the strongest single predictor of home value. |
| 2 | GrLivArea | Numerical | - | - | High (Strong Positive) | Above ground living area (sq ft) is a major determinant of price. |
| 3 | TotalBsmtSF | Numerical | - | - | High (Strong Positive) | Total basement area adds significant usable space, impacting price. |
| 4 | Neighborhood | Categorical | 0.62 | 1.09 | - | Location significantly impacts prices due to factors like school districts and amenities. |
| 5 | GarageArea | Numerical | - | - | High (Strong Positive) | Larger garage size contributes to convenience and value. |
| 6 | GarageCars | Numerical | - | - | High (Strong Positive) | Number of garage spaces affects usability and desirability. |
| 7 | 1stFlrSF | Numerical | - | - | Moderate to High | First-floor size is linked to living comfort and value. |
| 8 | GarageFinish | Categorical | 0.19 | 1.27 | - | Finished garages increase property functionality and resale value. |
| 9 | Foundation | Categorical | 0.18 | 1.20 | - | Foundation type (e.g., poured concrete) impacts longevity and maintenance costs. |
| 10 | FullBath | Numerical | - | - | Moderate | Number of full bathrooms influences home value but is secondary to space. |
| 11 | TotRmsAbvGrd | Numerical | - | - | Moderate | Total rooms above ground can add value but depends on layout and design. |
| 12 | Exterior1st | Categorical | 0.15 | 1.08 | - | Exterior material (e.g., brick, vinyl) affects curb appeal and durability. |
| 13 | GarageType | Categorical | 0.13 | 1.10 | - | Garage configuration (attached vs. detached) affects usability and appeal. |
| 14 | MasVnrType | Categorical | 0.12 | 1.15 | - | Masonry veneer type (e.g., stone, brick) contributes to structural aesthetics. |
| 15 | BsmtFinType1 | Categorical | 0.11 | 1.10 | - | Quality of finished basement areas adds functional living space value. |
| 16 | YearBuilt | Numerical | - | - | Low to Moderate | Newer homes typically have higher prices but with variability. |
| 17 | MoSold | Numerical | - | - | - | Month sold (MM) helps analyze seasonality and sales trends. |
| 18 | YrSold | Numerical | - | - | - | Year sold (YYYY) is useful for observing long-term market trends. |
To analyze seasonality and time-series trends, we include:
In the figures below, we focus on the selected
variables that demonstrate a strong relationship with
SalePrice. Notably, we exclude both YrSold and
MoSold to emphasize the more impactful features in our
dataset.
This composite figure shows 16 subplots illustrating both
scatter plots (top row) and box plots
(middle and bottom rows) for the selected variables. The scatter plots
reveal how various numeric predictors (e.g., living area, basement area,
garage size) trend with
SalePrice (fitted by the red line),
while the box plots capture how different categorical or discrete
features (e.g., quality ratings, neighborhood, foundation type)
influence sale prices. By examining these subplots together, we can
identify which factors most strongly affect house prices and use that
insight to guide further analysis.
This line chart depicts the monthly average of house sale prices, providing a temporal perspective on how property values fluctuate across different months. The red points mark the mean price in each period, while the blue line highlights overall trends over time.
Figure: Neighborhood Median Sale Prices
This interactive map displays each neighborhood’s median sale
price using color-coded markers for different price brackets. Hover over
a marker to see additional property details (e.g., living area, basement
size, overall quality), offering a localized perspective on how housing
values vary across the city.
These radar plots allow for side-by-side comparisons of selected attributes—such as living area, basement size, overall quality, and more—across one or multiple neighborhoods. By toggling different neighborhoods on or off, you can visually contrast their strengths and weaknesses in each category, providing deeper insight into how these factors influence housing values.
This line chart tracks how the median sale price fluctuates across different months, offering insights into potential seasonal or cyclical patterns. Red points mark each month’s median price, while the connecting line highlights the overall trend throughout the year.
This box plot groups sale prices by quarter (Q1 through Q4), illustrating how values fluctuate throughout the year. Each box captures the median and interquartile range, while outliers reflect unusually high or low transactions during that period.
This heatmap compares median house prices across neighborhoods (rows) and years (columns). Brighter or warmer tones indicate higher sale prices, while cooler shades reflect more modest values. By scanning across rows and columns, you can quickly spot how different neighborhoods have evolved over time and identify periods of notably high or low market activity.
This line chart tracks the median house price each year from 2006 to 2010. After peaking around 2007, prices show a downward trajectory, highlighting broader market shifts and economic influences within this timeframe.
This line chart highlights the broader timeline of median sale
prices from 2006 to 2010, with key economic events labeled along the
top. Spikes or dips in the trend line may correlate with these
milestones, suggesting possible cause-and-effect relationships in the
housing market.
Marked by the red dashed line, this plot shows median sale prices in the months before and after the Lehman Brothers bankruptcy. The aim is to detect any delayed impact on housing values following this significant financial collapse.
Centered on the tax credit event (red dashed line), this chart
tracks median sale prices to see whether the introduction of the
incentive had an immediate or gradual influence on home values.
By aligning monthly median prices around the onset of the
subprime crisis, this plot illustrates how housing values behaved just
prior to—and in the aftermath of—this pivotal economic turning point.
The top row shows histograms of various
numeric features—such as GrLivArea,
TotalBsmtSF, GarageArea, and
X1stFlrSF—highlighting their frequency distributions. The
middle row covers discrete or ordinal attributes like
YearBuilt, OverallQual, and
TotRmsAbvGrd, providing insights into how homes are spread
across different quality ratings and room counts.
In the bottom row, pie charts break down the
proportions of categorical features (e.g., FullBath,
GarageFinish, MasVnrType), revealing which
categories dominate each variable. These visualizations help us
understand the overall composition of the dataset and guide decisions
about how best to handle each variable in further analysis.
This four-panel plot displays: 1. The raw time series
data (top panel). 2. The seasonal component,
highlighting recurring monthly fluctuations. 3. The trend
component, illustrating the long-term direction in house
prices. 4. The remainder (bottom panel), capturing
short-term irregularities not explained by seasonality or trend.
Each component—Remainder,
Seasonal, and Trend—is shown in a
separate facet, with the x-axis representing time and the y-axis showing
each component’s magnitude. The seasonal curve exhibits
regular peaks and troughs over the year, the trend line
reveals how median sale prices evolve over time, and the
remainder indicates any short-term fluctuations once
the seasonal and trend effects have been removed.
Each off-diagonal panel shows a scatterplot for a pair of
numeric features (e.g.,
GrLivArea
vs. GarageArea), while the diagonal panels display the
univariate distribution of each variable (here, density plots).
Correlation coefficients in the upper panels summarize the strength of
each pairwise relationship, helping to identify potential outliers,
clusters, and overall patterns in the data.
This faceted box plot displays how the distribution of sale
prices changes with different overall quality ratings
(
OverallQual) across various neighborhoods. Each facet
corresponds to a specific neighborhood, helping you quickly see whether
high-quality homes fetch significantly higher prices in certain areas
compared to others. Red points mark potential outliers, indicating
unusually high or low values within each category.
## 'data.frame': 1460 obs. of 4 variables:
## $ GrLivArea : num 1710 1262 1786 1717 2198 ...
## $ TotalBsmtSF: num 856 1262 920 756 1145 ...
## $ GarageArea : num 548 460 608 642 836 480 636 484 468 205 ...
## $ OverallQual: num 7 6 7 7 8 5 8 7 7 5 ...
De Cock, D. (2011).
Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester
Regression Project.
Journal of Statistics Education, 19(3).
https://jse.amstat.org/v19n3/decock.pdf
Sirmans, G., & Macpherson, D. A.
(2003).
The Value of Housing Characteristics: A Meta Analysis.
https://www.researchgate.net/publication/5151851_The_Value_of_Housing_Characteristics_A_Meta_Analysis
NYC Data Science Academy. (2019).
Analyzing Data to Predict Housing Prices in Ames, Iowa.
https://nycdatascience.com/blog/student-works/analyzing-data-to-predict-housing-prices-in-ames-iowa-6/
El Mouna, L., Silkan, H., Haynf, Y., Nann, M. F., &
Tekouabou, S. C. K. (2023).
A Comparative Study of Urban House Price Prediction Using Machine
Learning Algorithms.
E3S Web of Conferences, 418, 03001.
https://doi.org/10.1051/e3sconf/202341803001
Retrieved from ResearchGate
Guo, J. (2023).
Feature Selection in House Price Prediction.
Highlights in Business, Economics and Management, 21,
14755.
https://doi.org/10.54097/hbem.v21i.14755
Retrieved from ResearchGate
Manasa, J., Gupta, R., & Nuggenahalli, N. S.
(2020).
Machine Learning Based Predicting House Prices Using Regression
Techniques.
In 2020 2nd International Conference on Innovative Mechanisms for
Industry Applications (ICIMIA) (pp. 9074952).
https://doi.org/10.1109/ICIMIA48430.2020.9074952
Retrieved from ResearchGate
Kuhn, M., & Silge, J. (2021).
Tidy Modeling with R: The Ames Housing Data.
https://www.tmwr.org/ames